Re-analysis of the microarray data published by Murat, A. et al. (2008) “Stem Cell–Related “Self-Renewal” Signature and High Epidermal Growth Factor Receptor Expression Associated With Resistance to Concomitant Chemoradiotherapy in Glioblastoma”

Álvaro Abella Bascarán (alvaro.abella01@estudiant.upf.edu)

Samuel Miravet Verde (samuel.miravet01@estudiant.upf.edu)

Introduction

Glioblastoma multiforme is the most presented and aggressive brain tumor in humans, involving glial cells, with an incidence of 2–3 cases per 100,000 person life-years in Europe and North America (Fonnet E. Bleeker, et al (2012)).

Treatment can involve chemotherapy, radiation and surgery. Median survival with standard-of-care radiation and chemotherapy with the alkylating agent temozolomide is 15 months(Johnson, Derek R. et al (2011)) while the median survival without treatment is 4 and a half months. Regretfully glioblastomas are notorious for resistance to therapy, which has been attributed to DNA-repair proficiency, a multitude of deregulated molecular pathways, and, more recently, to the particular biologic behavior of tumor stem-like cells. Here, based on the reference work of Murat, A. et al (2008), we aimed to identify the molecular profiles specific for treatment resistance to the current standard of care of concomitant chemoradiotherapy with temozolomide.

To achieve our goal, we take from the reference study a set of gene expression profiles of 80 glioblastomas of patients treated whithin clinical trials of concomitant and adjuvant temozolomide to radiotherapy (n=52) and patients treated with only radioterapy (n=28). In addition, 4 control patients were added to the study.

This document should be processed from R and you need to install the packages knitr and markdown. Once they are installed, you have to type the following instructions that generate a HTML document that you can open with a web browser:

library(knitr)     ## required for "knitting" from Rmd to md
library(markdown)  ## required for processing from md to HTML
knit("projectTemplate.Rmd", "projectTemplate.md")  ## process Rmd to md
markdownToHTML("projectTemplate.md", "projectTemplate.html") ## process md to HTML
browseURL("projectTemplate.html") ## open the resulting HTML file from R

Data

The microarray data used during the research of Murat, A. et al (2008) can be found at the Gene Expression Omnibus, with the accession number GSE7696. In order to proceed with this analysis, the file GSE7696_RAW.tar must be uncompressed under the directory ‘data/raw/’ and the file GSE7696_series_matrix.txt.gz must be placed inside ‘data/’.

Subsampling

The original data contained 84 samples, of which 4 are controls, 28 correspond to patients treated with radiotherapy, and 52 to patients treated with both TMZ and radiotherapy. In order to have a more manageable analysis we decided to proceed, by now, with only 21 samples. The following script was used to do a subsampling keeping the original proportions among controls, TMZ/radiotherapy and radiotherapy samples.

We need GEOquery to retrieve the assay data and affy to handle data from Affymetrix 3’-biased Arrays.

library(affy)
library(GEOquery)

This is a simple way to cache the full eset (all the assay data) in a simple object, to avoid loading it from the .txt.gz the next time (a costly process).

# Cache the data to avoid having to load it again next time from the .txt
if (file.exists('full_eset.rds')) {
  eset = readRDS('full_eset.rds')
} else {
  eset = getGEO(filename='data/GSE7696_series_matrix.txt.gz')
  saveRDS(eset, 'full_eset.rds')
}
## File stored at: 
## /tmp/RtmpR0EH1w/GPL570.soft

We separate the sample names by treatment group, and take enough samples from each group to have a total of 21, keeping the original proportions.

# extract sample names
samples = colnames(eset)

# separate the samples by treatment group
no_treated = samples[eset$characteristics_ch1.5 == "treatment: NA"]
tmz_rt = samples[eset$characteristics_ch1.5 == "treatment: TMZ/radiotherapy"]
rt = samples[eset$characteristics_ch1.5 == "treatment: radiotherapy"]

# take 21 samples, keeping the proportions of the original dataset
set.seed(123456)
no_treated = sample(no_treated, size=1)
tmz_rt = sample(tmz_rt, size=13)
rt = sample(rt, size=7)

Join the three arrays of sample names, and turn them into file names. Then read only the necessary CEL files. This allows us to retrieve the data much faster, as we are loading only the 21 required samples, and not the whole set of 84 samples.

subsample_names = c(no_treated, tmz_rt, rt)
# turn sample names into filenames
subsample_filenames = paste(subsample_names, '.cel.gz', sep='')

# read only the necessary the expression data
glioData = ReadAffy(celfile.path = "./data/raw", filenames=subsample_filenames)

sampleNames(glioData) = letters[1:21]
# display used samples:
subsample_names
##  [1] "GSM187180" "GSM187213" "GSM187178" "GSM187175" "GSM187232"
##  [6] "GSM187163" "GSM187188" "GSM187158" "GSM187222" "GSM187161"
## [11] "GSM187206" "GSM187184" "GSM187211" "GSM187208" "GSM187234"
## [16] "GSM187229" "GSM187225" "GSM187186" "GSM187192" "GSM187214"
## [21] "GSM187185"

We also filter the original assay data to keep only the data corresponding to the samples we have choosen. We finally save both objects for posterior use.

subsampled_eset = eset[, subsample_names]

# save the subsampled data for posterior use

if (!file.exists('eset.rds')) {
  saveRDS(subsampled_eset, 'eset.rds')
}
if (!file.exists('glioData.rds')) {
  saveRDS(glioData, 'glioData.rds')
}

eset = subsampled_eset

Quality assessment

library(affyPLM)
## Loading required package: gcrma
## Loading required package: preprocessCore

Chip inspection

We plot, in the first place, the chip images in order to detect artifacts and discard those samples corresponding to malfunctioning chips.

# Scanner plot analysis
par(mfrow = c(3, 7), mar = c(1, 1, 3, 1))
image(glioData)

From these images we cannot see any obvious artifact, and so far we can consider all of the samples as valid.

Analysis of intensities

We can further assess the quality of the samples analysing the distribution of intensities from each chip. Chips with a distribution which differs significantly from the rest (in median or dispersion) should be taken with caution.

A first step is to check the distribution of raw intensities in logarithmic scale:

# Raw intensity boxplot
par(mar = c(4, 5, 0, 1))
boxplot(glioData)
## 

From the previous boxplots we see that every sample follows a similar distribution. Sample number “u” appears to have a slightly narrower and displaced distribution, but not enough to consider it wrong.

We can get some more information (eg. bimodalities) by means of a non-parametric density estimation of raw intensity values:

# Density plot
par(mar = c(3, 3, 4, 1))
plotDensity.AffyBatch(glioData, lwd=2, col=1:21, lty=1:21)
legend('topright', LETTERS[1:21], col=1:21, lty=1:21, lwd=2, inset=0.1)

We see that all samples follow a similar distribution except in the case of “u”, which is shifted to the right and has a higher peak.

From the two previous diagnostics, we can consider sample “u” as wrong in terms of the distribution of intensities:

badSamplesRawDist = "u"

Probe level modeling

We can assess the quality of Affymetrix chips using a linear model which relates the log intensities to the probe affinity effects, the array effects and the gaussian noise. Once we have the model we can represent the raw intensities (left), residuals (middle) and weights (right).

Pset = fitPLM(glioData)
for (i in 1:21) {
  par(mfrow=c(1, 3))
  image(glioData[, i]) # raw intensities for sample A
  image(Pset, type="resids", which=i) # PLM resids for sample A
  image(Pset, type="weights", which=i) # PLM weights for sample A
}

From the previous figures we can see that sample “j” appears to have a more intense spot, maybe not enough to discard it if the rest of analysis seem to be alright.

badSamplesPLMResids = "j"

Normalized Unscaled Standard Errors (NUSE)

By means of the NUSE we examine the median and interquartile range of all probesets in the chip, to obtain an overall view of chip expression quality.

NUSE(Pset)

In this case we can take as valid those samples which don’t deviate from the rest by more than a 5%. As “u” is above 1.05, we can consider it as deviating from the rest.

badSamplesNUSE = "u"

Relative Log Expression (RLE)

Relative Log Expression Values are calculated for each of the probesets. In tis case we compare the expression on each array against the median expression value for the probeset across all of the arrays. The RLE summaries can be helpful to detect technical sources of variability that are large compared to biological variation.

# RLE
RLE(Pset)

All the samples display a similar median and interquartile range, with none of them exhibiting a very noticable deviation.

Besides the graphical inspection, we can also take a look at the median and interquantile range for each sample.

nuseDiag = NUSE(Pset, type='stats')
rleDiag = RLE(Pset, type='stats')
nuseDiag
##                 a          b          c          d         e          f
## median 1.02595822 0.99504661 1.01100108 1.00000000 0.9952775 1.00000000
## IQR    0.05764787 0.02651335 0.03858302 0.03152226 0.0282349 0.03399814
##                g          h         i          j          k          l
## median 1.0064602 0.99416968 0.9881506 0.99656328 1.01446358 1.00000000
## IQR    0.0371061 0.02546117 0.0222437 0.02737273 0.04226841 0.03162657
##                 m          n          o          p         q          r
## median 0.99150479 0.98863220 0.99417242 0.99065853 1.0221653 1.00000000
## IQR    0.02316854 0.02178958 0.02572843 0.02447591 0.0555395 0.03705234
##                s          t          u
## median 0.9973382 1.00000000 1.05437298
## IQR    0.0288322 0.03142741 0.07408849
rleDiag
##                   a           b         c            d           e
## median 0.0003659262 0.003572683 0.0000000 -0.004528611 -0.03432415
## IQR    0.4225988435 0.206909864 0.3015538  0.278456890  0.26681410
##                  f         g         h        i          j          k
## median 0.005887643 0.0000000 0.0000000 0.000000 0.01271614 0.03564668
## IQR    0.269635063 0.2900608 0.2067449 0.189758 0.23702907 0.31602571
##                  l            m         n           o         p          q
## median 0.001490737 -0.005064451 0.0000000 -0.03389845 0.0000000 0.05960175
## IQR    0.258603185  0.235332232 0.1994644  0.27184003 0.2272568 0.35984732
##                  r         s         t            u
## median -0.01387586 0.0000000 0.0000000 -0.002642683
## IQR     0.27873066 0.2433667 0.2759281  0.370749152

As in the case of the graphical inspection, we cannot see any sample which deviates significantly from the rest.

badSamplesRLE = c()

Normalization

In order to normalize the Affymetrix expression data we resort to the Robust Multi-array Average (RMA) algorithm. This method integrates the background correction, between-array normalization and summary.

# RMA normalization (Robust Multi Array)
rmaEset = rma(glioData)
## Background correcting
## Normalizing
## Calculating Expression

MA plots

The MA plot can help us detect intensity dependent biases. In this case we compare (using a log ratio) the intensity of the red and green channels of the microarray (each representing a different condition). As we expect most of the genes to not be differentially expressed, most of the log ratios should be close to zero. Plotting the log ratio against the mean log intensity of both channels, we can detect dependences of the ratios on the fluorescence intensity.

In this case, we expect the red line to remain close to the blue one. A deviation from it is suggesting a systematic dependence of the ratios on the intensity, indicating poor quality of the expression data.

par(mfrow=c(3, 7), mar=c(1, 1, 1, 1))
MAplot(rmaEset[-grep('^AFFX', featureNames(rmaEset))], plot.method='smoothScatter', cex=0.75, ref.title='')

In this case we don’t see any case extremely obvious, but being strict we can consider as wrong the following set of samples:

badSamplesMA = c("c", "d", "o")

We can finally display a table showing the number of diagnostics failed by each sample:

qaDiag <- data.frame(RawDist = rep(FALSE, ncol(eset)), PLMresids = rep(FALSE, ncol(eset)), NUSE = rep(FALSE, ncol(eset)), RLE = rep(FALSE, ncol(eset)), MA = rep(FALSE, ncol(eset)), Failed = rep(0, ncol(eset)), row.names = sampleNames(glioData))

qaDiag[badSamplesRawDist, "RawDist"] <- TRUE
qaDiag[badSamplesPLMResids, "PLMresids"] <- TRUE
qaDiag[badSamplesNUSE, "NUSE"] <- TRUE
qaDiag[badSamplesRLE, "RLE"] <- TRUE
qaDiag[badSamplesMA, "MA"] <- TRUE

qaDiag$Failed <- rowSums(qaDiag)
qaDiag <- qaDiag[order(qaDiag$Failed, decreasing = TRUE), ]
qaDiag
##   RawDist PLMresids  NUSE   RLE    MA Failed
## u    TRUE     FALSE  TRUE FALSE FALSE      2
## c   FALSE     FALSE FALSE FALSE  TRUE      1
## d   FALSE     FALSE FALSE FALSE  TRUE      1
## j   FALSE      TRUE FALSE FALSE FALSE      1
## o   FALSE     FALSE FALSE FALSE  TRUE      1
## a   FALSE     FALSE FALSE FALSE FALSE      0
## b   FALSE     FALSE FALSE FALSE FALSE      0
## e   FALSE     FALSE FALSE FALSE FALSE      0
## f   FALSE     FALSE FALSE FALSE FALSE      0
## g   FALSE     FALSE FALSE FALSE FALSE      0
## h   FALSE     FALSE FALSE FALSE FALSE      0
## i   FALSE     FALSE FALSE FALSE FALSE      0
## k   FALSE     FALSE FALSE FALSE FALSE      0
## l   FALSE     FALSE FALSE FALSE FALSE      0
## m   FALSE     FALSE FALSE FALSE FALSE      0
## n   FALSE     FALSE FALSE FALSE FALSE      0
## p   FALSE     FALSE FALSE FALSE FALSE      0
## q   FALSE     FALSE FALSE FALSE FALSE      0
## r   FALSE     FALSE FALSE FALSE FALSE      0
## s   FALSE     FALSE FALSE FALSE FALSE      0
## t   FALSE     FALSE FALSE FALSE FALSE      0

In general we have achieved very good results as none of the samples fail more than 2 tests so we decide to continue the study with all the samples. For future analysis considering the whole data set (>80) we could repeat this process and take only the best samples.

Batch identification

The main goal of this step consists in finding if there exist any possible batch effect in our dataset, that is technical sources of variation that have been added to the samples during handling (Leek, et al (2010)). It is important to detect them in order to ensure that the possible future differences between expressions do not come from a non-biological or scientific variables source.

We start the process loading the following libraries.

library(affy)
library(Biobase)
library(GEOquery)
library(affyPLM)
library(corpcor)

Between all the different features our dataset have, the only possible batch source is the experiment date of each sample so this will be our batch grouping variable. To consider it, we have to set a scanDate variable in the correct format:

scanDate = protocolData(glioData)$ScanDate
scanDate = gsub(" .*", "", scanDate)
scanDate = as.Date(scanDate, "%m/%d/%y")

Once defined that variable, we are already able to group together the samples obtained in the same date:

minscan = min(scanDate)
days = scanDate- minscan

sort(days)
## Time differences in days
##  [1]   0   0   2   2   2   2   3   3   3   3   3   7   7   7   7   7  10
## [18]  10  10  10 192

With those different day groups we can discretize the days numeric variable into desired intervals to finally obtain our surrogate batch indicator variable. In addition, the length of the variable days allows us to define a vector of colors that will be used in the posterior plots.

batch_days = data.frame(row.names=sort(unique(days)), batch=c(1:length(unique(days))))

ourcolors = rainbow(length(unique(days)))

The differential outcome for our dataset is the survival time as the aim of the study is to define a set of genes expressed differentially between resistant cancer patients and non-resistant. To consider it, we extract the variable from the eset and transform it into a character string.

survival_time = eset$characteristics_ch1.7
survival_time = gsub('.*:', '', survival_time)
survival_time = as.character(survival_time)

The division of our outcome respect the batch groups can be resumed with the table function:

table(data.frame(Outcome = survival_time, Batch = days))
##         Batch
## Outcome  0 2 3 7 10 192
##    10.43 0 1 0 0  0   0
##    11.84 0 0 0 1  0   0
##    14.41 0 0 1 0  0   0
##    15.53 0 0 1 0  0   0
##    15.99 0 0 1 0  0   0
##    16.22 0 0 0 0  1   0
##    16.28 0 0 0 1  0   0
##    17.07 0 0 0 0  1   0
##    17.34 1 0 0 0  0   0
##    22.89 0 0 0 0  1   0
##    24.41 0 1 0 0  0   0
##    25.13 0 0 0 0  0   1
##    28.22 0 0 0 1  0   0
##    36.45 0 0 1 0  0   0
##    4.44  0 0 1 0  0   0
##    46.55 0 0 0 1  0   0
##    5.95  0 0 0 1  0   0
##    7.5   1 0 0 0  0   0
##    8.06  0 1 0 0  0   0
##    8.75  0 1 0 0  0   0
##    NA    0 0 0 0  1   0

Hierarchical Clustering

A nice way to see how effectively batch is confounding the outcome of interest consists of doing a hierarchical clustering of the samples indicating the batch to which each of them belongs to. We should start by calculating the distance between every pair of samples using a non-parametric association measure such as Spearman correlation:

d = as.dist(1 - cor(exprs(eset), method="spearman"))

Using hclust we perform the hierarchical clustering which can be easily understood, showing it in a dendrogram. We are trying to identify the batch effect (the batch to which each sample belongs is indicated by the colour of the label), and a differential clustering separating samples by survival time (indicated in the sample label).

# Hierarchical analysis:

sampleClustering = hclust(d)

# Function to generate the dendrogram:

sampleDendrogram = as.dendrogram(sampleClustering, hang=0.5)
batch = days
names(batch) = sampleNames(eset)
names(survival_time) = sampleNames(eset)
sampleDendrogram = dendrapply(sampleDendrogram, function(x, batch, labels) {
    ## for every node in the dendrogram if it is a leaf node
    if (is.leaf(x)) {
      print(x)
        attr(x, "nodePar") = list(lab.col = as.vector(ourcolors[batch_days[as.character(batch[attr(x, "label")]),]]))
        attr(x, "label") = as.vector(labels[attr(x, "label")])  ## label by survival_time
    }
    x
}, batch, survival_time)
## 'dendrogram' leaf 'GSM187180', at height 0.06152877 
## 'dendrogram' leaf 'GSM187213', at height 0 
## 'dendrogram' leaf 'GSM187158', at height 0 
## 'dendrogram' leaf 'GSM187222', at height 0 
## 'dendrogram' leaf 'GSM187206', at height 0 
## 'dendrogram' leaf 'GSM187192', at height 0 
## 'dendrogram' leaf 'GSM187214', at height 0 
## 'dendrogram' leaf 'GSM187161', at height 0 
## 'dendrogram' leaf 'GSM187184', at height 0 
## 'dendrogram' leaf 'GSM187188', at height 0 
## 'dendrogram' leaf 'GSM187229', at height 0 
## 'dendrogram' leaf 'GSM187178', at height 0 
## 'dendrogram' leaf 'GSM187211', at height 0 
## 'dendrogram' leaf 'GSM187175', at height 0 
## 'dendrogram' leaf 'GSM187208', at height 0 
## 'dendrogram' leaf 'GSM187234', at height 0 
## 'dendrogram' leaf 'GSM187163', at height 0 
## 'dendrogram' leaf 'GSM187232', at height 0 
## 'dendrogram' leaf 'GSM187185', at height 0 
## 'dendrogram' leaf 'GSM187225', at height 0.01668604 
## 'dendrogram' leaf 'GSM187186', at height 0.01668604
# And the required command to plot it:

plot(sampleDendrogram, main = "Hierarchical clustering of samples")
legend("topright", paste("Batch", sort(unique(batch))), fill = ourcolors)

In this dendrogram we can see that our samples do not cluster by batch, indicating that apparently in this case we don’t have batch effect. We would like to identify a separation by survival time, but it is clear that the different survivals are mixed. We don’t see, so far, a relationship among different survival times and expression.

Multidimensional Scaling

Another way to verify whether batch is the main source of variation is multidimensional scaling analysis. To perform it, we have to run the function cmdscale. The resulting cmd object have to be a matrix of dimension (n,2) where n=number of samples.

cmd = cmdscale(as.dist(1 - cor(exprs(eset), method = "spearman")))

dim(cmd)
## [1] 21  2
head(cmd)
##                   [,1]          [,2]
## GSM187180  0.075935177  0.0243188995
## GSM187213 -0.001443038 -0.0000049331
## GSM187178 -0.017200821 -0.0137915294
## GSM187175 -0.011023521 -0.0132489966
## GSM187232  0.016500092 -0.0032648222
## GSM187163  0.021697818 -0.0024718422
# Ploting the result
plot(cmd, type = "n")
text(cmd, survival_time, col = ourcolors[batch_days[as.character(batch),]], cex = 0.9)
legend("topleft", paste("Batch", unique(batch)), fill = ourcolors[batch_days[as.character(unique(batch)),]], inset = 0.01)

All the survival times are mixed well and we cannot detect any color group in the distribution meaning that there is no bacth effect that could confound a possible differential classification of the outcome.

Quantifying Confounding

The last test we are going to perform is the quantification of confounding between batch and outcome by means of a Principal Component Analysis test. It works rebuilding the data set in new variables (Principal Components) in terms of linear combinations, such that they capture most of the variance of the data. The purpose of applying PCA in this case is to compare the principal components that account for the largest variability of the data, to known variables such as a batch indicator.

We start the process computing the single value decomposition (SVD) of the data:

s = fast.svd(t(scale(t(exprs(eset)), center=TRUE, scale=TRUE)))

This value allows us to plot the fraction of variance explained by each principal component as follows:

plot(s$d^2/sum(s$d^2), type="b", lwd=2, las=1, xlab="Principal Component", ylab="Proportion of variance")

Using the function cumsum we can evaluate the amount variance accumulated with each principal component:

head(cumsum(s$d^2/sum(s$d^2)))
## [1] 0.1478175 0.2542525 0.3444673 0.4312466 0.4951665 0.5536456

In this case we explain more than 50% of variance considering 6 principal components.

Finally, in order to get an estimate the amount of variability driven by batch, we should inspect the correlations between our batch indicator variable and the right-singular vectors in the component v of s:

par(mfrow=c(2, 3))
for (i in 1:6) {
  boxplot(split(s$v[, i], batch), main=sprintf("PC%d %.0f%%", i, 100 * s$d[i]^2/sum(s$d^2)))
}

We can see the distribution, for each of the batch subgroups, of each of the six principal components. We can appreciate that the batch number 10 has a wider distribution across components 1, 2 and 3, while batch number 2 exhibits more variance in components 4 and 5. The results are, however, hard to interpret due to the low (and different) number of samples in each of the subgroups (eg. in the subgroup corresponding to day 192, we only have one sample), and we should not infer a batch effect from this analysis.

Surrogate Variable Analysis

With the SVA we try to identify sources of heterogeneity (eg. non-biological variability due to batch effects). With this analysis we should obtain an estimation of the surrogate variables and their values, which we can later use to adjust for undesirable effects.

library(sva)
## Loading required package: mgcv
## Loading required package: nlme
## This is mgcv 1.8-6. For overview type 'help("mgcv-package")'.
## Loading required package: genefilter
## 
## Attaching package: 'genefilter'
## 
## The following object is masked from 'package:base':
## 
##     anyNA
# We need the outcome variable in binary format, with a threshold of 20 months of survival
st_bin <- ifelse(survival_time >= 20, 1, 0)

# Define the full and null models
mod = model.matrix(~st_bin, data=pData(eset))
head(mod)
##           (Intercept) st_bin
## GSM187180           1      1
## GSM187213           1      1
## GSM187178           1      1
## GSM187175           1      1
## GSM187232           1      0
## GSM187163           1      1
mod0 <- model.matrix(~1, data = pData(eset))

# surrogate variables calling
sv <- sva(exprs(eset), mod, mod0)
## Number of significant surrogate variables is:  6 
## Iteration (out of 5 ):1  2  3  4  5
# Plot the correlations
par(mfrow = c(2, 3))
for (i in 1:sv$n.sv) boxplot(sv$sv[, i] ~ batch, main = sprintf("SV %d", i), xlab = "Batch")

The boxplots show almost no correlation of the batch indicator variable with every of the estimated surrogate variables.

The sva package also provides a function to quickly perform F-tests for detecting genes significantly changing between the full and null models. This enables a quick overview of the impact of adjusting for the estimated heterogeneity:

# Number of genes changing
pValues <- f.pvalue(exprs(eset), mod, mod0)
sum(p.adjust(pValues, method = "BH") < 0.05)
## [1] 0
dim(eset)
## Features  Samples 
##    54675       21
# Changes after adjustment
modSv <- cbind(mod, sv$sv)
mod0Sv <- cbind(mod0, sv$sv)
pValuesSv <- f.pvalue(exprs(eset), modSv, mod0Sv)
sum(p.adjust(pValuesSv, method = "BH") < 0.05)
## [1] 0

We do not see any gene with a significant change between the full and null model.

After the analysis conducted so far, we can assert that there is no batch effect in our samples. However, we have also seen that the samples corresponding to different survival times have clustered in a rather heterogeneous way, which might indicate a lack of relationship among gene expression and survival when we take into account all the groups at the same time.

Differential expression

With the aim to identify changes in gene expression associated to a specific condition, we perform a gene differential expression analysis. The condition in this analysis is the survivavility of the tumoral condition in patients treated with concomitant chemoradiotherapy(TMZ/RT->TMZ).

library(Biobase)
library(genefilter)
library(limma)
library(sva)
library('hgu133plus2.db')
annotation(eset)<-'hgu133plus2.db'

First we setup the input data in a format suitable for the analysis.

eset = eset[, eset$characteristics_ch1.6 != "survival status: NA"] # Remove control patients
eset$characteristics_ch1.1 = factor(eset$characteristics_ch1.1) # disease status: re-recurrent GBM
eset$characteristics_ch1.2 = factor(eset$characteristics_ch1.2) # patient id
eset$characteristics_ch1.3 = factor(eset$characteristics_ch1.3) # age: dd.d
eset$characteristics_ch1.4 = factor(eset$characteristics_ch1.4) # gender: MALE/FEMALE
eset$characteristics_ch1.5 = factor(eset$characteristics_ch1.5) # treatment: radiotherapy or TMZ/radiotherapy
eset$characteristics_ch1.6 = factor(eset$characteristics_ch1.6) # survival status: 0 or 1
eset$characteristics_ch1.7 = factor(eset$characteristics_ch1.7) # survival time in months: 25.13
eset$characteristics_ch1.8 = factor(eset$characteristics_ch1.8) # mgmt status: M or U
eset$mgmt = eset$characteristics_ch1.8
eset$treatment = eset$characteristics_ch1.5

eset$survival_status = as.double(gsub('.*: ', '', eset$characteristics_ch1.6)) == 1
eset$survival_status = factor(eset$survival_status)

# Transforming some variables to dichotomic
survival_threshold = 20
eset$survival_time_dichotomic = as.double(gsub('.*: ', '', eset$characteristics_ch1.7)) > survival_threshold
eset$survival_time_dichotomic = factor(eset$survival_time_dichotomic)

age_threshold = 50
eset$aged = as.double(gsub('.*: ', '', eset$characteristics_ch1.3)) > age_threshold
eset$aged = factor(eset$aged)

Fold-change

To evaluate the expression levels we calculate the fold-change expression for each gene, comparing the expression levels between the surviving and non-surviving groups.

zeroExp <- rowMeans(exprs(eset[, eset$survival_time_dichotomic]))
oneExp <- rowMeans(exprs(eset[, eset$survival_time_dichotomic != TRUE]))
par(mfrow = c(1, 2))
plot(zeroExp, oneExp, xlab = "Zero", ylab = "One", pch = ".", cex = 4, las = 1)
plot((oneExp + zeroExp)/2, oneExp - zeroExp, pch = ".", cex = 4, las = 1)

We can see that, in general, all the genes cluster together. We have certain points that deviate from the bulk, but not enough to imply an obvious differential expression. However, we can proceed with further analysis to verify this.

Significance

We need to assess the significance of the differentially expressed genes, applying a Benjamini-Hochberg multiple test correction.

allTests <- rowttests(eset, eset$survival_time_dichotomic)
min(allTests$p.value)
## [1] 8.3133e-06

We need to adjust for multiple testing.

# FDR adjustment
padjFDR <- p.adjust(allTests$p.value, method = "BH")
sum(padjFDR < 0.05)
## [1] 0
min(padjFDR)
## [1] 0.4545297

From this test it is clear that there are too many genes, and the multiple test correction renders all the tests insignificant due to the large number of tests. A possible solution to this problem consists in removing genes that are uninteresting biologically-wise.

First, we remove genes that show little variation. Following the procedure of the paper, we remove the genes that have a standard deviation below 0.75.

maskHighVariability <- apply(exprs(eset), 1, sd)>0.75
eset_filtered <- eset[maskHighVariability, ]

This reduces the dataset from 54675 genes down to 3557. Although there is an increase in significance with the lower amount of tested genes, it is still not enough, as seen on the minimal corrected p-value obtained.

fewerTests <- rowttests(eset_filtered, eset_filtered$survival_time_dichotomic)
padjFDR <- p.adjust(fewerTests$p.value, method = "BH")
sum(padjFDR < 0.05)
## [1] 0
min(padjFDR)
## [1] 0.1844852

We have seen how removing the genes with lower standard deviation have helped to obtain lower p-values by means of reducing the number of tests, but these are still far from the 0.05. To reduce even more the number of analysed probe sets we use the nsFilter function. This utility function filters out features exhibiting little variation, or a consistently low signal, across samples, as well as features with insufficient annotation.

filtered <- nsFilter(eset_filtered, require.entrez=TRUE,
         require.GOBP=FALSE, require.GOCC=FALSE,
         require.GOMF=FALSE, require.CytoBand=FALSE,
         remove.dupEntrez=TRUE, var.func=IQR,
         var.cutoff=0.5, var.filter=TRUE,
         filterByQuantile=TRUE, feature.exclude="^AFFX")
eset_filtered <- filtered$eset
fewerTests <- rowttests(eset_filtered, eset_filtered$survival_time_dichotomic)
padjFDR <- p.adjust(fewerTests$p.value, method = "BH")
sum(padjFDR < 0.05)
## [1] 0
min(padjFDR)
## [1] 0.5728377

The dataset is reduced to 1191 genes. As seen in the result, now the lowest p-value is higher than in the previous case. It is obvious that the corresponding gene was removed by the function nsFilter, maybe due to a low signal or a deficient annotation.

Moderated t-test using limma

To prevent problems associated to limited replication, we use the empirical Bayes method from the Bioconductor limma package to calculate a moderated t-test that takes into account the typical standard deviation of the genes.

We project our data to a linear model and calculate the moderated t-statistics.

design <- model.matrix(~survival_time_dichotomic, data = eset_filtered)
fit <- lmFit(eset_filtered, design)
fit <- eBayes(fit)
res <- decideTests(fit, p.value = 0.1)
summary(res)
##    (Intercept) survival_time_dichotomicTRUE
## -1           0                            0
## 0            0                         1191
## 1         1191                            0

We can see from the summary result that there are no significantly expressed genes. We can also check the results from the distribution of p-values for all the genes and the volcano plot.

tt <- topTable(fit, coef = 2, n = Inf)
par(mfrow = c(1, 2), mar = c(4, 5, 2, 2))
hist(tt$P.Value, xlab = "Raw P-values", main = "")
hist(tt$P.Value, xlab = "Raw P-values", breaks = 1000, main = "")

This distribution appears to be slightly negatively skewed (more density of p-values near 1). Adjusting for covariates we will be able to reduce this effect and obtain a more uniform distribution.

plot(tt$logFC, -log10(tt$P.Value), pch=".", cex=4, col=grey(0.75), cex.axis=1.2, las=1, cex.lab=1.5, xlab=expression(paste(log[2], " Fold change")), ylab=expression(paste(-log[10], " P-value")))
if (sum(tt$adj.P.Val < 0.1) > 0) {
  abline(h=-log10(max(tt[tt$adj.P.Val < 0.1, "P.Value"])), col=grey(0.5), lty=2)
}
points(tt[tt$adj.P.Val < 0.1, "logFC"], -log10(tt[tt$adj.P.Val < 0.1, "P.Value"]), pch=".", cex=4, col="red")

If we had p-values lower than 0.1, we would be able to see a line separating the differentially (above the line) and non-differentially (below the line) expressed genes. However, in this case our minimum p-value is higher than this threshold of 0.1.

Adjusting for covariates

We can adjust for the contribution of covariates. As proposed in the paper, we adjust for MGMT methylation status and for age>50.

eset_filtered$mgmtM = eset_filtered$mgmt == "mgmt status: M"

# Surrogate Variables
# adjusted for age (> 50 years) and MGMT methylation status
mod <- model.matrix(~survival_time_dichotomic + mgmtM + aged, data = eset_filtered)
mod0 <- model.matrix(~ mgmtM + aged, data = eset_filtered)

svaobj <- sva(exprs(eset_filtered), mod, mod0)
## Number of significant surrogate variables is:  4 
## Iteration (out of 5 ):1  2  3  4  5
modSVs <- cbind(mod, svaobj$sv)

fit <- lmFit(eset_filtered, modSVs)
fit <- eBayes(fit)
ttadj <- topTable(fit, coef = 2, n = Inf)

par(mfrow = c(1, 2), mar = c(4, 5, 2, 2))
hist(ttadj$P.Value, xlab = "Raw P-values", main = "")
hist(ttadj$P.Value, xlab = "Raw P-values", breaks = 1000, main = "")

Now we can see that the distribution is much closer to uniform. We can do again a volcano plot to check if the significances have changed.

plot(ttadj$logFC, -log10(ttadj$P.Value), pch=".", cex=4, col=grey(0.75), cex.axis=1.2, las=1, cex.lab=1.5, xlab=expression(paste(log[2], " Fold change")), ylab=expression(paste(-log[10], " P-value")))
if (sum(ttadj$adj.P.Val < 0.1) > 0) {
  abline(h=-log10(max(ttadj[ttadj$adj.P.Val < 0.1, "P.Value"])), col=grey(0.5), lty=2)
}
points(ttadj[ttadj$adj.P.Val < 0.1, "logFC"], -log10(ttadj[ttadj$adj.P.Val < 0.1, "P.Value"]), pch=".", cex=4, col="red")

Still, we have no significant genes.

Conclusion

During this analysis we have explored the data provided by Murat, A. et al. (2008). We have performed a quality assessment, ensuring that the samples have enough quality to proceed with the analysis. In addition, we have tried to identify sources of batch effect, concluding that these samples are not affected by this type of technical bias. Finally, we have made a first attempt to detect differentially expressed genes on two groups of samples: those corresponding to patients with a survival above and below 20 months.

The analysis of differential expression has revealed no significant genes. However, we have made the analysis jointly for both TMZ/radiotherapy and radiotherapy sample groups. One possible explanation for the lack of significant results is that the genes affecting survival time are different in each kind of treatment. This suggest that the next step is to analyze separately both sample groups.

Finally, the analysis driven by Murat, A. et al. includes a process of gene clustering using CTWC (see their gene clusters here), leading to a reduced set that is key to obtain statistically significant results. In our case we have tested each gene separately, an approach with a much lower statistical power.

Future steps

A similar analysis could be performed analyzing the differential expression separately for TMZ/radiotherapy and radiotherapy samples. Moreover, we could do the analysis including all of the 81 samples, instead of using a reduced set of just 21 samples, which might reveal more significant genes.

In addition, as each samples has several phenotipical characteristics, it might be interesting to perform the previous analysis focusing on a variable different from the survival time.

More importantly, we could study the possibility to use a clustering technique to group related genes in order to improve the statistical significance, as we will be considering biological and functional aspects.

Session information

sessionInfo()
## R version 3.2.0 (2015-04-16)
## Platform: x86_64-pc-linux-gnu (64-bit)
## Running under: Ubuntu 14.04.2 LTS
## 
## locale:
##  [1] LC_CTYPE=en_US.UTF-8       LC_NUMERIC=C              
##  [3] LC_TIME=es_ES.UTF-8        LC_COLLATE=en_US.UTF-8    
##  [5] LC_MONETARY=es_ES.UTF-8    LC_MESSAGES=en_US.UTF-8   
##  [7] LC_PAPER=es_ES.UTF-8       LC_NAME=C                 
##  [9] LC_ADDRESS=C               LC_TELEPHONE=C            
## [11] LC_MEASUREMENT=es_ES.UTF-8 LC_IDENTIFICATION=C       
## 
## attached base packages:
## [1] stats4    parallel  stats     graphics  grDevices utils     datasets 
## [8] methods   base     
## 
## other attached packages:
##  [1] hgu133plus2.db_3.1.2  org.Hs.eg.db_3.1.2    RSQLite_1.0.0        
##  [4] DBI_0.3.1             AnnotationDbi_1.30.1  GenomeInfoDb_1.4.0   
##  [7] IRanges_2.2.1         S4Vectors_0.6.0       limma_3.24.3         
## [10] sva_3.14.0            genefilter_1.50.0     mgcv_1.8-6           
## [13] nlme_3.1-120          corpcor_1.6.7         hgu133plus2cdf_2.16.0
## [16] affyPLM_1.44.0        preprocessCore_1.30.0 gcrma_2.40.0         
## [19] GEOquery_2.34.0       affy_1.46.0           Biobase_2.28.0       
## [22] BiocGenerics_0.14.0  
## 
## loaded via a namespace (and not attached):
##  [1] BiocInstaller_1.18.1 formatR_1.2          XVector_0.8.0       
##  [4] bitops_1.0-6         tools_3.2.0          zlibbioc_1.14.0     
##  [7] digest_0.6.8         annotate_1.46.0      evaluate_0.7        
## [10] lattice_0.20-31      Matrix_1.2-0         yaml_2.1.13         
## [13] stringr_1.0.0        knitr_1.10.5         Biostrings_2.36.1   
## [16] grid_3.2.0           survival_2.38-1      XML_3.98-1.1        
## [19] rmarkdown_0.6.1      magrittr_1.5         htmltools_0.2.6     
## [22] splines_3.2.0        xtable_1.7-4         KernSmooth_2.23-14  
## [25] stringi_0.4-1        RCurl_1.95-4.6       affyio_1.36.0